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ABSTRACT 

Descriptive captions help organize noncomputerized media. But automated use of 
captions in retrieval from computerized muliunedia databases has not been much exam- 
ined because it would seem to require significant natural language processing. We argue 
that captions can be iiatuniUy expressed in a restricted language whose interpretation is 
easier than genend natural-language uiulei'staiiding. We describe a multimedia database 
system that stores interpreted captions in predicate calculus for each media datum; it then 
inleq>rcls restricted-language queries, and finds matching media objects. In exploring 
tlicse ideas for two database applications, we have recognized tlirce important issues. (1) 
Using a caption does not require deep understanding of it, just a compiehensive type 
liierarcliy for concept types in it. (2) Captions can be accessed faster than media data 
because they are much smaller. So to access media data, we should map first to captions 
titrough a hash table. We argue that only nouns and verbs should be hashed, and that 
additional pointers should link subtypes to types. A "coarse-grain” search can intersect 
hash-table lists to find a candidate set of captions for a query; a "fine-gmin" search can 
then cinefully attempt matching the query to each, with variable binding, etc. (3) ’’Super- 
captions”, describing sets of oilier captions, can minimize caption redundancy; supercap- 
lions can be of other supercapiions, etc. Pointers to supercaplions simplify the hash 
table. But now there is a conflict between exploring subcaplions and exploring super- 
types of an entry in the hash table; we propose concurrent processing to solve ibis. 
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1. Introduction 

Descriptive captions have long been valuable in organizing and retrieving from multimedia data. Exam- 
ples include English descriptions below newspaper photos, titles on slides, record jackets, and labels on 
videos. Captions can focus on only the important things in a media datum, but unlike a keyword list, a cap- 
tion can have a complex structure mirroring the structure of a media datum. Although some multimedia 
database systems store captions, rarely have captions been used to aid retrieval since that seems to require a 
big dictionary and complex natural- language understanding routines. Nonetheless, we believe that 
software teclinology has now made sufficient progress to use captions routinely to help retrieval from many 
multimedia databases. 

For example, the Naval Weapons Center at China Lake, California keeps a library of 36,000 photographs. 
Since many of the photographs appear very similar, each photograph has an associated caption that is 
stored in a computer database, such as this for Figure 1: 

Air to air, TP87A209 Sidewinder AJM 9M test with F-15 aircraft USAF 82028 and F-16 air- 
craft ASAF 83131 of 422nd Test and Evaluation Squadron. Full side views of both aircraft and 
individually uploaded with missiles. Excellent. LHL 253149, 51 and 52 released S. B. Oster 
Pao, 5/29/89. 

Notice that this language is considerably more formal than everyday English, and thus is not as difficult to 
parse and interpret. However, the Naval Weapons Center currently uses such caption text only as a source 
of manually-selected keywords wliich are matched to keywords supplied by a user. Clearly, much valuable 
information in captions is being ignored, like the information in the above paragraph about the relationship 
of the aircraft to tlie squadron, the missiles to the aircraft, and tlie person to the rest of the caption. So 
keyword-based retrieval specifying "Sidewinder” and "aircraft" could also mistakenly fmd pictures of diun- 
age done to aircraft by the Sidewinder. Furthermore, some reasoning about a caption is necessary to match 
it, since many things are implied but not stated directly, like tliat a Sidewinder is a missile, "AJM 9M" is 
the version code of a missile, and that "excellent" refers to a clarity scale for photographs. So keyword- 
based retrieval specifying "Sidewinder", "test", and "excellent" to find excellent test results could mistak- 
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enly retrieve Fig. I . 

We propose that arbiuarily-long unformatted captions, in part from natural language, be created for every 
media datum in a multimedia database. Captions can also be created for sets of media data (like on all pic- 
tures taken during a particular missile test), and inherit to subcaptions. Then queries to the media data 
could first check all this caption information. This could save query processing time, because a caption can 
be stored in much less space than most media data, and a faster storage could be used than that for the 
media data; then failure to match a query to the caption eliminates the need to retrieve the media datum 
from slow storage. And users could browse the caption data to decide what they are interested in before 
costly media-data retrievals (keyword lists are not as helpful to browsers). To simplify matcliing, both cap- 
tions and queries can be parsed and inteq^reted, then represented as ’’meaning lists” of semantic properties 
and relationships; this can be done long in advance for the captions. 

Natural language (e.g. English) processing by computer has made slow but steady progress in recent years, 
and it is becoming increasingly efficient while at the same time allowing considerable subtlety of expres- 
sion. Using natural language solves many of the ambiguity problems in the relationship of words in key- 
word lists, improving the precision of query matches. Thus we are including a natural-language processing 
component in our caption-oriented multimedia database system. Understanding natural-language descrip- 
tions of the contents of multimedia databases is usually a considerably simpler problem than that of general 
natural-language understanding, since the universe of discourse is usually quite constrained. Nouns tend to 
be concrete since they usually correspond to observables in the media data, and quantifiers and oilier logi- 
cal operators are rare since often the easiest way to describe a media datum is to describe its separate 
pieces separately. And most multimedia databases emphasize still photographs and other fixed-time graph- 
ics to which few verbs can be applied, and verbs are one of the hardest aspects of natural language process- 
ing. But most importantly, we use natural language only to access entities in a database, and complete 
understanding of the words is not necessary for this goal. For instance, for the query ’’Air to air missiles 
mounted on iiircraft”, it is unnecessary to know exactly what ’’missile” and ’mounted” mean to match tlie 
query to the above example caption, just that a Sidewinder is a missile and uploading is a kind of mounting. 
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Besides indexing by keywords, an alternative to caption matcliing is content analysis of media data at 
query time, but this usually requires too much computational effort. There are some exceptions, such as 
scanning a short block of text to find a particular name. But such purely syntactic analysis is inflexible and 
its utility is limited for pictures, video, and audio for which inferencing is often needed. For instance, we 
would have a hard time finding Fig. 1 for a query asking for pictures of Sidewinder missiles: The missiles 
are small in the photograph aixl easily confiisable with the gas tanks hanging from the bottoms of the 
planes, and they cannot easily be found in the picture until the plane outlines are found first by processing 
of the entire picture. (A reasonable digital representation of this picture would be 500 by 500 bytes.) And 
additional information mu.st always supplement content analysis: The name of the squadron to which the 
planes in Fig. 1 belong is not indicated in the photograph, nor the identity of what is being tested. 

2. Previous work 

Many researchers have worked on the problem of accessing multimedia data efficiently, although we know 
of no one who has tried to use captions in the central way that we do, nor anyone who has exploited cap- 
tions on sets of captions. There is a variety of related research in information retrieval, database design, 
and artificial intelligence, for which we can cite some representative papers. Some researchers in informa- 
tion retrieval have investigated "semantic" representations of retrieval objects instead of the standard key- 
word lists. Kolodner's’ pioneering work embedded facts for retrieval in a complicated semantic network, 
and used a variety of .special heuristics suggested by human reasoning to intelligently search that network; 
the primary concern was computer-generated explanations of text data, a more difficult problem than ours. 
Cohen and Kjeldsen^ proposed spreading activation over a semantic network to find qualitatively good 
associative matches. Rau^ in SCISOR proposed a two-stage retrieval process from a semantic network in 
which the first stage was a spreading activation and the second was matching between a subgraph and a 
graph; input was English questions, so a significant portion of the implementation was devoted to natural- 
language processing and explanation of text data. Smith et al"* in EP-X handled term-name differences 
between query and datum by using a hierarchy of concepts, where all levels could have pointers to retrieval 



objects. 
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Researchers in database design have been increasingly interested in multimedia databases. Some of this 
research concerns good ways of describing multunedia data for efficient retrieval, as the special summary 
data to describe pictures in Chang et al ^ and the special parameters for describing video in Nagel ^ Such 
descriptive information should be part of a good caption on the media datum. Other research concerns 
efificient administration of a database system containing multimedia objects, which can often be difficult 
because of its highly varied and liighly storage-intensive formats. Bertino et al ^ and Roussopolous et al ® 
exemplify this work, with an emphasis on conceptual modeling and query languages. 

A longtime concern of artificial intelligence has been manipulating descriptions of the world, and many of 
its results apply to our problem. A variety of books address practical issues in knowledge representation 
for artificial intelligence, as Rowe Grosz et al exemplifies the current state of natural-language pro- 
cessing tools, in presenting a powerful design tool for creating natural-language parsers and interpreters for 
a wide variety of domains. Wilensky provides an example of a powerful natural-language system that 
can be used to answer a wide range of English questions about the UNIX operating system; its success sug- 
gests that natural-language processing can be feasible and efficient for a surprisingly broad domain. 

3. Overview of our caption-based access to multimedia data 

Fig. 2 shows a block diagram of the data structures in our caption-based approach to efficient access of 
multimedia data, and Fig. 3 describes the blocks. Humans interact with our system at two places, the top 
left and the top right comers; on the left, human experts supply media data and tlieir associated captions for 
storage in a multimedia database, arxl on the right, non-expert humans query tlie data, Tire actual media 
data (wliich comprise the multimedia database) are stored in a separate system on a separate processor, 
since media data generally require far more space than the access data structures discussed here. We 
expect pictures will usually be the most common fomi of media data, and each picture will be at least the 
complexity of a television picture (500 by 500 bytes), and we have a target of one million media data items 
in this design, so the multimedia database should be about 10*^ bytes. This number and the generally 
read-only nature of the media data strongly suggest optical storage, which is slow for random access. 
Furthermore, multimedia data can come in many different formats, suggesting an object-oriented database 
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system. Previous work by B. Holtkamp, V. Lum, and the first author proposed a details of it, and work 
is continuing on its implementation (although that work also proposed including captions and registration 
information in that database, we now think it a poor idea). Since the multimedia database will function 
mostly independently of the caption-based retrieval, we will not discuss its details further in this paper. 

The main innovation of our design is the access to media data through semantically richer infonnation- 
meaning lists, parsed and interpreted captions-instead of keywords. Meaning lists are lists of predicate- 
calculus expressions giving the ’’meaning” of captions, and are equivalent to semantic networks; Fig. 4 
gives an example. Usually they can be written as lists of literals because logical conjunction is usually the 
primary operator necessary: A caption usually specifies the meaning of each part of a natural- language 
utterance, then requires tliat the "and” of all these meaning components must hold. Variables in the argu- 
ments to the literals can relate the parts of a caption description; in Fig. 4, the variables are tlie codes con- 
sisting of a letter followed by a number. Methods for obtaining meaning lists are described in section 4.1. 

Besides the captions themselves, our system requires auxiliary infonnation from a lexicon, a concept 
hierarchy for the domain, aixl frame recognition rules. Tlie lexicon (or dictionary) is necessary for parsing, 
and gives for each possible natural-language word its ’’meaning”; its part of speech, its grammatical forms, 
and tlie form of the literals needed to represent it. Ten thousand words exclusive of proper nouns is a rea- 
sonable lexicon size for most applications. Many of the hardest words to represent in a lexicon-for 
instance, conjunctions and quantifying adjectives-are consistent in meaning across a wide range of 
domains, so we can just borrow their interpretation from existing natural-language systems; the words that 
significantly change between applications are the nouns and a few verbs, and their representation is more 
straightforward. The concept hierarchy is a type hierarchy on the key concepts that can be included in 
meaning lists. It has both upward pointers (for semantic checking after paising of natural language) and 
downward pointers (for finding captions with terms that are subtypes of those in the query); there can be 
more than one upward pointer from a concept. Lastly, the frame-recognition rules add generalization 
terms to meaning lists that reflect inferences beyond what tlie natural language actually said, like the 
implied firing of the missiles in tlie first caption of section 1. 
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The meaning bsts for queries are used to find relevant media data by two phases, a coarse-grain search and 
a fine-grain search. Tlie coarse-grain search does hash-table lookup of all occurrences of certain helpfully 
restrictive terms in the literals, those corresponding to nouns and verbs in the original natural-language 
input. This gives a set of caption pointers to all caption objects containing these identifying literals, and 
thus candidates for satisfying the query. Then a fine-grain search matches the full query meaning list 
against the candidate captions’ meaning lists, binding variables as necessary. 

A million media data items means a million captions. We expect an average caption will take 100 bytes; 
captions should summarize, not exhaustively catalog. So the caption database will be about 100 megabytes 
uncompressed, and compression techniques can make it significantly smaller. Note in Fig. 2 that some of 
the caption database is allocated to supercaptions. These are captions that describe a class of media data, 
eliminating redundancy: Fig. 4 shows some example supercaption information. Supercaptions are an 
important part of our design, and are a more user-friendly way of modeling hierarchical structure in data 
than an index on keywords; section 4.3 will discuss them further. 

We have applied our design to two important applications. We first built a prototype of some portions of 
this design for the domain of the military history of U.S. forces in the Pacific in World War II. We used 
media data of pictures digitized from published books about World War II, about 100 photos in all, plus 
some aerial photographs of an army base. We used the captions printed with those photos in tl)e books, 
and some captions of our own for the aerial photographs, and we wrote a reasonably general augmented- 
transilion network parsing and interpretation routine in Prolog; in the latest version the parsing times, on 
test sentences averaging fifteen words, are all less than ten seconds with uncompiled Prolog. The lexicon 
was 575 words, of which 227 were nouns and 46 were proper nouns. The meaning lists were tlien con- 
verted by code in C to an SQL-like language tliat accessed an INGRES database. The hardware was a Sun 
workstation. With the success of the prototyf>e, we are now working with a significant existing database of 
records, botli historical and current, of projects at the Naval Weapons Center at China Lake, California. 
Currently the database contains online captions of 36,000 photographs themselves stored offline; we are 
putting the photographs and other media data online in an optical jukebox. To demonstrate the generality 
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of our methods, we wiU also be including in our database the text of project reports, engineering drawings, 
viewgraphs for project presentations, video of project tests, and audio of test pilot dialogues. Our intention 
is to provide a multimedia database for proposal writing, public relations, and library purposes for die vari- 
ous development projects at China Lake, but the methods employed should apply to any research organiza- 
tion. The processing hardware will include a network of Sparc workstations. We intend to continue to use 
Prolog for some parts of the design, but the natural-language processing will be done by purchased 
software (see section 4.1). 

4. Extraction of meaning lists 

For efficient retrieval it is important that we store meaning-list representation of a caption and not the cap- 
tion itself: natural language processing of captions at queiy time would enormously increase processing 
time. Following previous software development we use meaning lists in Prolog linked-list format, lists 
of literals where most literals express properties or binary relationships. To simplify matching, we are try- 
ing to limit the properties luxl relationsliips to a small set of primitive properties iuid relationships; for 
instance, we will not distinguish between the relationships asserted by the terms "within", "inside", "part 
of, "containing", and "comprising". Again, to do efficient retrieval, it is not necessary that the meaning 
lists capture the full meaning and implications of an English expression, just that they express enough of 
tlie main intent to find obvious matchings. 

4.1 Ways of obtaining meaning lists 

We are exploring tliree ways of obtaining meaning lists for captions and queries about captions, each useful 
for certain kinds of information. One is a structured menu approach where we ask the user a series of 
questions derived from a decision tree. For instance for the picture described by the caption in Fig. 4, we 
could ask the user to look at the picture and give the main action; then who is doing the main action; then if 
tliere is any action object; then whetlier there are any modifiers that can describe the action (like adverbs); 
tlien if any adjective modifiers can describe the subject noun; and so on. Willi this approach, parsing is 
simple. To save time, liie user can be asked to confinn default values (some words strongly imply others). 
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A second way of obtaining meaning-list information is by content analysis of the media data. Although we 
dismissed this for use at query time in section 1, it could be used in setting up the caption database if the 
analysis were not complex. For instance, we could compute the predominant color in a picture or the grain 
size of the predominant texture. But it is much easier to be told that a picture represents an F-16 aircraft 
than trying to trace the plane’s outline and then identify it. 

Third, we can actually parse the restricted English of a sentence representing a caption or query, aixl this is 
the friendliest approach for a user. Some powerful natural-language understanding software is appearing. 
After a survey of what was available, we have begun using DBG from Language Systems Inc. (Woodland 
Hills, California). We found its speed was reasonable on test sentences. Its lexicon must be supplied in 
part by us; some of this information is the type information we will discuss in section 4.2, and other is stan- 
dard morphology (suffixes and prefixes of words). Generally speaking, the most difficult words in English 
are multi-domain multi-use words like conjunctions and prepositions, but tlieir meanings do not vary much 
between domains and ilieir lexicon entries can be copied from existing lexicons. Additional lexicon infor- 
mation can be obtained by structured menus addressed to the designer, as in the TEAM Project 

We will allow only descriptive captions, as opposed to background. For instance: 

U. S. soldiers wading ashore in columns churn up the waters off Morotai Island, midway 
between western New Guinea and the Philippines. MacArthur wanted Morotai so Allied air- 
craft could operate from there and protect his Philippine landings. The Morotai invaders met 
no resistance, (from R. Steinberg, World War II: Island Fighting, Time-Life Books, 1978) 

Only the first half of the first sentence actually describes the photograph. Tliis is a common convention; for 
instance, in randomly selected articles of National Geographic we found in 110 out of 120 caption para- 
graphs that die first sentence was the only descriptive one. We will also exclude captions whose associated 
pictures merely invoke a theme, as a caption about the Navy’s budget for a picture of an aircraft carrier. 

On the other hand, our Naval Weapons Center database has many multi-sentence captions in which all sen- 
tences are descriptive, like die example of section 1. Frequently diese captions exemplify a kind of multi- 
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sentence graiiiiiiar where llte sentences occur ui a particular order. For instance, lliis caption is typical of 
many in the database: 

Skipper missile validation of A-6E aircraft loading check list Closeup views of missile and MK 

7 loader, and wire/electrical connections. LHL 226648 released D. Kline, 12/13/85. 

First a lesling action and its subject are described; then in a separate sentence, the focus and nature of the 
photograph; then in a separate sentence, the authorization for release of the photograph. The example of 
section 1 follows the same scheme. Thus a simple discourse grammar can parse many of these captions to 
make interpretation even easier. 

4.2 Conceptual generalizations: type hierarchies and frames for stereotypical actions 

To pennit captious to be sliort, conceptual generalization on the contents of meaning lists must be possible. 
Conceptual generalization can exploit three kinds of information: a concept hierarchy, frames for domain 
stereotypes, and supercaptions. First, a complete and thorough type hierarchy for the concepts (nouns and 
verbs) in the domain of discourse must be created. For instance for military liistory, part would give geo- 
graphical areas and locations, part would give the kinds of military ships, and part would give the different 
kinds of maneuvers a military ship can engage in. Fig. 5 gives the top of the hierarchy for the military his- 
tory domain. Specihcally in the Fig. 4 example, "U.S.” is a country, "columns” is a kind of military forma- 
tion, "Morotai Island” is a place in the western Pacific, "chum” is a side effect of physical motion in liquids 
and semi-liquid materials, and "wading” is a locomotion used by humans in crossing water of only a nar- 
row range of depth. Such information can be obtained from domain experts using techniques of knowledge 
acquisition for expert systems. Obtaining all such information may seem considerable work for the 
designers of multimedia database system. But most of it can come from a natural-language dictionary, and 
it is necessary anyway for a good hierarchical indexing scheme on keywords, without which user-friendly 
access through keywords is impossible. 

The second kind of generahzation information we need is the "frame” or "script” abstraction that frequently 
occurs in describing often-stereotypical humcin activities. The terms "ashore," "wading," and "waters” in 
tlie caption of Fig. 4 together suggest that there is a beach-landing operation going on, a stereotypical kind 
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of military operation. Certainly, we can create a hierarchy of military operations that includes a beach 
landing. But we would not be able to recognize from Ihe concept Iiierarchy alone that a beach landing is 
referenced in this sentence, since no single word indicates it, only tlie combination of clues. Tliis kiinl of 
recognition is a "frame" or "script" problem and needs techniques like those in Schank and Abelson 
The abstractions and their clues must be obtained from an expert in the domain. We expect the number of 
different such abstractions to be small. For instance, military activities necessary to explain a World War 
II data base exemplify about ten concepts (see Fig. 7); each has stereotypical ways of accomplishing them 
with particular props, and each has associated preconditions and effects. So when we recognize these 
stereotypical concepts in meaning lists, we should insert extra summary terms into the lists, as additional 
terms to exploit in matching captions to queries. 

4.3 Conceptual generalizations: supercaptions 

Our iliird kind of conceptual generalization seems to be an idea unique with us: the siipercaption, a caption 
that describes more than one media datum. For instance (see Fig. 6), the Moroiai Island caption in Fig. 4 
could be a subcapiion for the supercapiion "Black/white photographic record of U.S. in World War II in 
the Pacific", which in turn could be a subcaption of the supercrq?iion "Historic black/whiie photographs of 
combat". Supercaptions can be obtained from a domain expert just like captions, and are most useful when 
they give complex meaning-list information unobtainable from the concept hierarchy, like the dates, times, 
and places common to a set of photos of a battle. Supercaptions can create a hierarchical structure dif- 
ferent from lire type hierarchy of domain concepts, as in Fig. 6. Supercaptions can represent how an expert 
clusters media data, unlike groupings based on single data features. 

"Stub" or "registration" information, about how a set of media objects were created, is naturally expressed 
with supercaplions. For instance for a photograph or video, this includes the photographer, the type of film, 
the exposure, tlie date and time the picture was taken, the place where the picture was taken, and so on. 
Tliese properties usually apply to classes of pictures, and would require unfairly tedious labor to enter 
separately for every picture. 
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Parsing and interpretation of supercaptions involves some issues not addressed witli captions. One is 
univereal quantification: can the supercaption information be appended to each of its subcaptions? For 
instance, does each picture in the series of pictures entitled ’’Morotai Island actions by U.S." show an event 
on Morotai, or do some pictures show background, preparations, or aftermath? The alternative is to treat 
the supercaption as a "theme” for conceptual clustering. Another important question is whether the mul- 
timedia data referred to by the supercaption represents an complete enumeration; if so, we can make 
several powerful inferences. For instance, the supercaption "The American naval sliip types of World War 
II” implies that every ship type is shown in at least one media datum, and furthennore every media datum 
contains at least one sliip type. 

Linguists have not devoted attention to this specialized issue, so we have developed our own heuristics for 
their semantics. The key in most single-sentence captions seems to be the nature of the grammatically cen- 
tral noun in the caption, and usually that is the noun in the subject noun phrase; for instance, "types" in 
"Tlie American naval ship types of World Wju II". Let the variable corresponding to the grammatically 
central noun be a , and a predicate asserting the truth of the conjunction of all the meaning-list literals 
linked to it be p (a ). Then; 

—Rule 1: If p(a) is a plural noun equivalent directly depictable in the media data referred to by the 
siipercaption, or represents a supertype of something depictable, then \/c esubcaptionsi^x p(jc,c )). 
This follows from the idea that each subcaption must advance the "argument" of the supercaption, 
and if a subcaption referred to a picture tliat did not contain the main noun of the supercaption, it 
would in some sense be inadequate in supporting the claim of the supercaption. For instance, for 
"The American naval sliip types of World War II." 

-Rule 2: If the main type of the caption is a single event (events are not "directly depictable"), then 
interpret all events in tlie subcaption as parts of the larger event in the supercaption. That is, 
p {s )/Wc e subcaptions(^ E events(c )[par(jof (e,.y)]). For instance, if the supercaption is "Morotai 
Island actions by U.S." tlien all verb forms in subcaptions denote actions that are part of Morotai 



Island actions by the U.S. 
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— Rule 3: If the main type of the supercaption is accompanied by the determiner "tlie” or is itself a 
non-picturable type referring to an aggregate (as denoted by the English words "catalog," "gallery," 
"display," "index," etc.), followed by the word "of’ and a prepositional phrase, then completeness of 
the subcaptions in representing the supercaption can be assumed. That means, taking the noun type 
in the "of prepositional plirase as p (at ), that Vjc (p (x )->\/c e subcaptions(B z [in{z ,c )l^ (r )])). For 
instance, "The American naval ships of World War II" implies that there exists a caption pointed to 
by the supercapiion that contains every possible type. 

-Rule 4: If none of the preceding rules apply, the supercaption must be interpreted as a tlieme 
invoked only for indexing of supercaptions, and it has no implications for its subcaplions. 

All otl)er tenns in meaning list that are linked to the main variable follow similar quantilication to that in 
tl>e above rules. 

5. Retrieval using captions 

Given a query on our multimedia database, we will translate it into a meaning list. Exploiting the captions 
for retrieval means first finding captions whose meaning lists match key terms of the query meaning list 
{coarse-grain search): then for each that matches the whole caption, we retrieve the corresponding media 
object {fine-grain search). This two-stage search postpones the handling of the usually-bulky media data. 
To furtlier simplify matters, we assume tlie query contains no quantifiers. 

Tliere are many ways to use semantic infonnation such as captions for retrieval, not all efficient. Tlie 
approach of Kolodner ’ used special-purpose heuristics good for modeling everyday human reasoning but 
not necessarily good for technical domains. The approaches of Cohen and Kjeldsen ^ and Smith et al 
explored a semantic network, but only a uniformly structured one (by topic associations in the first, and a 
type hierarchy in the second); thus they cannot exploit the full range of knowledge that we do with our 
three kinds of conceptual generalization. So Rau’s SCISOR ^ is the closest to what we want to do, with its 
emphasis on a variety of knowledge for different purposes; it used a two-phase search process like ours. 
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5.1 Fine-grain search 

Our fme-gram search is by definition inatcliing done wiili die full captions. This inevitably requires a 
subgraph-matching algorithm, that tries to match pieces of a caption by binding variables and backtracking 
as necessary. Subgraph matching is much addressed in computer science, and diere are many algorithms 
for the many special cases of it. In all algorithms, combinations must be tried until a match is found. In the 
worst case, the general subgnq)h-matching problem is exponential in complexity since the general algo- 
rithms are NP-hard. The worst case will not often be approached in real databases witli real user queries, 
as it requires a few predicate names to be used repeatedly in meaning lists, which is unlikely considering 
tlie hurmm origins of captions aiid queries. 

5.2 Coarse-grain search 

Tl)e coarse-grain search must map from key terms of the meaning lists to caption pointers. Since we have 
one million captions, we will need log 2 l 0^=20 bits for each caption pointer; since we will have about 50 
indexable items per caption based on our examination of good human captions, we will need about 125 
megabytes for hash-table pointers alone to the captions. Tliis suggests the pointers be in secondary storage. 
Since we expect to use widely scattered portions of the caption access data at any otre time, a hasliiiig 
scheme is better than an irxlex. 

So we identify key lenns in the meaning list translation of a user query, hash these to a secondary-storage 
hash table of caption pointers, intersect the pointer lists (we assume by default that a user wants captions 
exactly matching die whole query), and look up the corresponding captions. But what are the "key" terms? 
After analysis of sample captions, we concluded that only the equivalents of nouns aixl verbs as they 
appear in meaning lists provide sufficiently restrictive information on the set of target data to make them 
worthwliile to exploit in a coarse-grain search. Conjunctions, auxiliaries, expletives, and pronoutis do not 
translate directly into meaning lists. Prepositions and adverbs usually provide only weak restrictive infor- 
mation arxl can be fuzzy (for instance, when is one object nortli of another in a picture?) Some adjectives 
like "U.S." are helpful, but usually only non-abstract adjectives; the "alert" in "alert soldiers" contributes 
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far less. Verbs can be useful, but probably less so than nouns because they are hard to depict in media 
data. 

Our hash table gives only exact matches for a query term, since the hash table is necessarily large. For 
instance, if a caption mentions Morotai Island, then only the hash table entry for “Morotai Island'* points to 
it, not die entry for ’’Western Pacific” or ’’Battle sites of World War II”. So a query that does mention 
"Western Pacific” must use the concept hierarchy to reach other hash-table entries to find the Morotai 
Island caption. This will save much space at the expense of time to follow the downward pointers of the 
concept hierarchy (but significant clustering of these references on hash-table pages can probably be done). 
We can also save space by using supercaption pointers in the hash table as well as caption pointers. A 
supercaption pointer can represent many subcaption pointers, and the linkage can be specified in another 
table. 

Although our prototype implementation was for a single processor, the coarse-grain search can use con- 
current processing for the conjunctive portions of queries, where each processor writes to a shared memory 
of candidate caption pointers. We intend to do this on a network of Sparc workstations. Initially, each key 
term in the query meaning list can be assigned a separate processor with its own list of caption pointers it 
has found so far in its designated area of the shared memory. Each processor can use the concept hierarchy 
to find subtypes of its term in the concept hierarchy, and the supercaption-subcaption table to find subcap- 
tions. Whenever a processor exliausts all possibilities for its pointers, it goes through the pointer lists gen- 
erated by the other processors and (1) eliminates all those that do not appear in its own list, and (2) elim- 
inates from its own list all pointers not appearing in lists of otljer exhausted processors. Tlie first processor 
to finish will tend to be the one finding the fewest pointers and hence having die most restrictive terms, and 
tliis processor will eliminate possibilities first, the most efficient way of doing a set intersection. Note that 
this approach permits the first few media datums found to be supplied to the user while processing contin- 
ues to find others: this can keep the user happy during a long search. 

Fig. 8 shows an example of concurrent coaise-graiii search. An English question is parsed and interpreted 
to create a meaning list. At the same time frame recognition rules infer diat a beach-landing frame is 
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applicable, and add a tenn for it to tlie meaning list. Tluee key tenns in tlie meaning list are each assigned 
a separate process \o look up caption pointers: "photos", "landings", and "Philippines campaign". We also 
establish a processes for llie beach-operation frame. Now "photos" has subtypes of military and civilian 
photos in the concept liierarchy. So we can establish separate processes for these to find captions that 
reference them explicitly; and for all subtypes of these subtypes; and so on. "Philippines campaign" is a 
term likely to be in a supercaption; so one pointer for it in the hash table could be to a supercaption in the 
caption database (as we could quickly identify if supercaptions had a designated range of pointer codes). 
Then we could establish separate processes to find the subcaptions of the supercaplion, while still trying to 
find direct pointers to "Philippine campaign." Here the subcaptions would be for the various battles 
involved in tliat campaign; we could explore tlieni and tlieir subevenis, returning all caption pointers 
encountered as we find them. 

5.3 Further details of the coarse-grain search 

Tlie only detail in Fig. 8 as yet unexplained is the relation to beach operations to amphibious actions. This 
is an example of our alias handling, cross-referencing from one equivalent term to another. Aliases are 
common in natural language and are important to the user-friendliness of text-based interfaces. For 
instance, "plane", "airplane", and "aircraft" mean the same thing. Most of this can be handled in tlie lexi- 
con by assigmng the same literals to represent the meaning of the aliases. But when aliases are near but not 
exact, like "beach operation" and "amphibious action", it makes more sense to postpone their handling to 
the coar.se-grain search when they can serve as .search heuristics. We can designate one alias as primary, 
arxl store pointers with it: all other aliases can just have a special flag and a pointer to the primary alias. 

We expect that negated terms will be rare in captions, since the point of a caption is to describe presences, 
not absences. But negatives can occur in queries, as for instance "Non-U.S. soldiers in the Philippines 
campaign." We can retrieve the pointers of the negated term with a separate processor just as before, but 
now we eliminate pointers in otlier processors' lists that do occur in the pointer list for this negation proces- 
.sor. Al.so, non-negation processor? should delete pointers from their own lists that occur in tlie list of ;uiy 
negation processor, even if the negation processors are not done. 
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If we write the query as a conjunction of disjunctive expressions, the algorithm of the last section can be 
applied separately to each item in the conjunction. Then the disjunctions can be treated just like the sub- 
types and subcaptions, which are implicit disjunctions. Disjunctions in captions should be rejected as too 
vague to be a good description. Again, we assume no quantifiers in queries. 

The concept hierarchy of seaion 4.2 is an ”a_kind_or or "generalization/specialization” hierarchy, and the 
coarse-grain search algorithm exploits the predominantly downward nature of inheritance with respect to 
these links. However, several other kinds of inheritance can also occur, as discussed in Rowe ^ and can be 
exploited by a smarter algorithm. One classic example is with the ”part_or or ’’containment” relationship 
between concepts. For instance, if query asks for pictures of planes with ceramic-composite wings, that 
should match a caption describing a ceramic-composite plane, since a wing is part of a plane. Tliis kind of 
inference won’t work at all for certain propenies (like cost) and works in the opposite direction for other 
properties (like defectiveness of a part, which inherits upwards to give defectiveness of a plane containing 
the piul). A rule-bitsed inference system is necessary to specify all tlie cases. 

5.4 Time efficiency of our approach 

To show tliat our media data search is efficient in its use of time, we must compare it to other methods of 
information retrieval. To be fair, we cannot compare it to the methods that store media data in main 
memory or secondary storage since the total amount of media data we want to store is too large; the 
’’spreading activation” idea used in that work could require enormous numbers of optical-jukebox disk 
fetches, since significant clustering of access pointers is hard to achieve. So the best comparison is to EP-X 
^ with its media-data pointers embedded in a type liierarchy of keywords. For EP-X, fine-grain search must 
be done by the user, so unnecessary extra media data is retrieved compared to our approach. On tlie other 
hand, our approach requires that all queries go tlirough a new secondary-storage structure, the captions 
database. Since we are talking about slow secondary and tertiary storage, and algorithms requiring little 
main-memory processing (except perhaps for parsing, which preliminary experiments convince us can be 
done in at worst a few seconds), page access time will greatly override all other time costs. Let cs be the 
cost of secondary storage page fetches for die captions, ct tlie cost of media data page fetches, n the 
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number of media dalum pointers produced by EP-X or the number of caption pointers we produce, and p 
tl)e probability that a medium datum pointer in EP-X or a caption pointer in our system will satisfy the 
fine-grain search criteria. Assume all other secondary storage page fetches are negligible in cost (concept- 
hierarchy and supercaption-hierarchy pointers will show a high degree of clustering, and their page fetches 
can be done concurrendy with the cj^tion and media-data fetches). Then our approach will be better than 
EP-X if ncr>ncs+ripcT^ or when p Ki-ics/cr). (Actually, we are being conservative in assuming that the 
same number of caption pages and media object pages will be needed; otherwise the n on the left side must 
be increased.) In our system currently under development, we estimate the paging cost ratio will be about 
0.1 based on claimed times of the hardware we are using (18 msec, seek for magnetic disk, 90 msec, .seek 
for optical disk, 10 seconds for exchanging disks in the Jukebox) and the assumption that enough clustering 
of media data references on optical disks can be done so that exchanging disks is necessary only once in 
about 100 page fetches. So the fine-grain search need only exclude one caption in ten in order that our 
caption-based approach be faster; thus fine-grain search does not have to rule out much in order that our 
approach be better. At the same time, our approach will be more user-friendly since the user can work in 
natural language. 

5.5 Partial matching to a query 

A common user error is putting so many restrictions in a query that its answer set is empty. With our 
caption-based approach, tliis circumstance can be identified without going to the multimedia database, at 
worst in the fine-grain search, or at best in the coarse-grain search without going to the caption database 
eitlier. When this happens, it is helpful for the system to automatically try partial matching, finding cap- 
tions tliat satisfy some generalization of the query. Three modifications of our processing algorithm make 
tliis not difficult to do. First, we can find pointers that occur in all but at most K of the pointer lists inter- 
sected, tlie lists corresponding to the key query terms. Second, we can search upward in the concept 
liierarchy as well as downward: to supertypes of terms, or to supercaptions of captions. Third, we can fol- 
low less exact aliases of terms in the concept hierarchy. 



All tliree ideas are quantifiable, and lliey trade off with one another, so an A* search is strongly suggested 
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lo find (he best "near miss" media data. Then the cost used in the A* search can be ihe sum of (1) (he 
number of (previously iniersec(ed) pointer lists in wiiich tlie term does not occur; (2) log 2 of the ratio of the 
generalization set size to the starting set size (set sizes being determined by counting corresponding mul- 
timedia objects in advance); and (3) -log 2 of tlie subjective probability that an item satisfying the alias terni 
will satisfy a user requesting the original term. Tlie wei gluing of these tliree cost factors will need lo be 
determined by trial and error, analogous weighting problems arise frequently in information retrieval and 
many methods have been developed for them. 

6. Customization for tlie user 

There are many opportunities for optimization to the needs of a particular user in our system. Lexicon, 
caption-pointer, caption, and media-data pages can all be cached with a least-recently-used replacement 
policy. Hence, we should place the most closely related items together on pages wherever possible. It may 
also be good to cache results of caption-pointer intersections, which amounts to caching of structures rather 
than keywords, a liigh-level form of cacliing. User customization of the natural-language processing is not 
as imponani, but information as to particular word senses of ambiguous words that the user employs can be 
stored. 

7. Conclusion 

Captions are a natural way to organize multimedia data. But using captions in a significant way in an 
automated retrieval system is a difficult problem which requires conceptual innovations as well as the sort 
of significant effort we have described our project, which we believe is the first frontal assault on caption- 
based data retrieval. Much work remains to be done. We are confident now we have a design that can 
work. 
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Caption: US soldiers wading ashore in columns churn up the waters off Morotai Island. 



Parse tree (a summarization of program procedure calls): 



sentence 

(nounphrase 



verbphra.se 



(nounplirase 

(nounphrase(adjectivelist(adjeclive(”US")),noun("soldiers")), 

parlicipleplirase 

(participlerwading"),adverbrashore”)) ), 
preposilionalphrase(preposilion("in"),noun("coluiims”)) ), 

(verbgroup(verb('chum”), parti cle(”up”)), 
nounphrase 

(nounphrase(determinerrthe").nounrwaters"))» 
preposilionalplirase 
(prepositionC'ofF'), 
propemounf’Morolai Island") ) ) ) ) 



Meaning list (actual program output): 
tpIuraJ(12), soldier(G), name(12,U.S,), place(f2), 
wade(f2), aclion(wade,g2), iense(g2, present), uansitive(g2), 
place(f2,0), inside(l2,i2), plural(i2), column(i2), 

chum(f2,h2), action(chum,d2). plural(d2), tense(d2, present), direction! f2,0), 

plural(h2), watcr(li2), definile(h2), 

location! h2, 12), name (12. Morotai Island), place(l2)] 

Frame inferred: beach-landing 

Example meaning terms inheritable from supercaptions: 
fphoiograph(a3), focus! a3,medium-range),colornuige(a3,blackwliite), 
wait a3, "World H"),area(a3," Pacific Ocean"), 

campaign! a3, "Philippines recapture")] 



Figure 4: An example parse tree and corresponding meaning list <d)tained by our parsing and 
Interpretation program, plus examples of additional information inferrable or inheritable 
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1. physical objects 

1.1. geographical locations 

1.1.1. liuid units 

1.1.2. water 

1.1.3. air 

1.1.4. mixed units 

1.2. vehicles 

1.2.1. land 

1.2.2. water 

1.2.3. air 
1.3 weapons 

1.4. otlter military equipment 

1.5. people 

1.5.1. military 

1.5.2. civilian 

1.6. organizations 

1 . 6 . 1 . military 

1.6.2. civilian 

1.7. terrain 

1.8. weather 

2. abstract objects 

2.1. facts 

2.1.1. observations 

2.1.2. measurements 

2.1.3. thoughts 

2.2. events 

2.3. plans 

2.4. directions 

2.5. communications networks 

2.6. responsibility 

2.7. ntilitary actions 

2.7.1. aggression 

2.7.2. defense 

2.7.3. prepiuation 



Figure 5: Top levels of the concept hierarchy for a multimedia database of World War II military 
history 






1 



Island landings 
during wars 



«ofld War XI 
, 193:^*1945 
l’r<j» a S k' 
parspectiy© , 




i 

■■'::.Ca«^aIgn'::;tc>v; 
..1943M 



ixe?i;h 

ippin< 

3|l9.4^ 



Biaoifc/wKiW 

pti<Hi6hkkphn 



I 



M 

: of coafcat. ,• 



Morptai ■ ■ .. . 

, Island 

▼ 

:•■ Uk S; soldiers wading ashore 
ixv^;;:-::. in col uffins . chum' up the 
n.J/rvstars off korotai Island 



i 

^ai 



World War ii 
in the P^ifio 




Figure 6: An example supercaption hierarchy 
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Plan: To find a sequence of steps ll)at will achieve a goal 
Order: To comnnuid someone to do soinetliing 
Secure: To achieve a goal 
Attack: Aggression from one entity on another 
Defend: To act to minimize aggression by another 
Attempt: To try to perform some action 

Maneuver: To move in steer and an object through air, sea, or land 
Neutralize: To make a strategic asset important 
Disguise: To make a strategic object more difficult to recognize 
Fortify: To make a strategic object more diflicult to aggress upon 



Figure 7: The ten basic military-history frames we use 
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Figure 8: An example of processes created 
in a coarse-grain search with concurrency 
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